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Abstract 

In high-dimensional linear regression, the goal pursued here is to estimate an unknown 
regression function using linear combinations of a suitable set of covariates. One of the key 
assumptions for the success of any statistical procedure in this setup is to assume that the 
linear combination is sparse in some sense, for example, that it involves only few covariates. 
We consider a general, non necessarily linear, regression with Gaussian noise and study a 
related question that is to find a linear combination of approximating functions, which is at 
the same time sparse and has small mean squared error (MSE) . We introduce a new estima- 
tion procedure, called Exponential Screening that shows remarkable adaptation properties. 
It adapts to the linear combination that optimally balances MSE and sparsity, whether the 
latter is measured in terms of the number of non-zero entries in the combination (£q norm) 
or in terms of the global weight of the combination {l\ norm). The power of this adaptation 
result is illustrated by showing that Exponential Screening solves optimally and simultane- 
ously all the problems of aggregation in Gaussian regression that have been discussed in 
the literature. Moreover, we show that the performance of the Exponential Screening es- 
timator cannot be improved in a minimax sense, even if the optimal sparsity is known in 
advance. The theoretical and numerical superiority of Exponential Screening compared to 
state-of-the-art sparse procedures is also discussed. 

Mathematics Subject Classifications: Primary 62G08, Secondary 62G05, 62J05, 62C20, 
62G20. 

Key Words: High-dimensional regression, aggregation, adaptation, sparsity, sparsity oracle 
inequalities, minimax rates, Lasso, BIC. 

1 Introduction 

The theory of estimation in high-dimensional statistical models under the sparsity scenario has 
been considerably developed during the recent years. One of the main achievements was to 
derive sparsity oracle inequalities (SOI), i.e., bounds on the risk of various sparse estimation 
procedures in terms of the £q norm (number of non-zero components) of the estimated vectors 
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or their approximations (see Bickel et al., 2009; Bunea et al., 2007a, b; Candes and Tao, 2007; 
Koltchinskii, 2008, 2009a,b; van de Geer, 2008; Zhang and Huang, 2008; Zhang, 2009, and ref- 
erences therein). The main message of these results was to demonstrate that if the number 
of non-zero components of a high-dimensional target vector is small, then it can be reason- 
ably well estimated even when the ambient dimension is larger than the sample size. However, 
there was relatively few discussion of the optimality of these bounds, mainly based on specific 
counter-examples or referring to the paper by Donoho and Johnstone (1994a), which treats the 
Gaussian sequence model. The latter approach is, in general, insufficient as we will show below. 
An interesting point related to the optimality issue is that some of the bounds in the papers 
mentioned above involve not only the £q norm but also the l\ norm of the target vector, which 
is yet another characteristic of sparsity. Thus, a natural question is whether the t\ norm plays 
an intrinsic role in the SOI or it appears there due to the techniques employed in the proof. 

In this paper, considering the regression model with fixed design, we will show that the role 
of t\ norm is indeed intrinsic. Once we have a "rather general SOI" in terms the £o norm, a SOI 
in terms of the l\ norm follows as a consequence. This means that we can write the resulting 
bound with the rate which is equal to the minimum of the £q and l\ rates (see Theorem 3.2). 
Unfortunately, the above mentioned "rather general SOI" is not available in the literature for the 
previously known sparse estimation procedures. We therefore suggest a new procedure called 
the Exponential Screening (es), which satisfies the desired bound. It is based on exponentially 
weighted aggregation of least squares estimators with suitably chosen prior. The idea of using ex- 
ponentially weighted aggregation for sparse estimation is due to Dalalyan and Tsybakov (2007). 
Dalalyan and Tsybakov (2007, 2008, 2009, 2010) suggested several procedures of this kind based 
on continuous sparsity priors. Our approach is different because we use a discrete prior in the 
spirit of earlier work by George (1986a, b); Leung and Barron (2006); Giraud (2008). Unlike 
George (1986a, b); Leung and Barron (2006); Giraud (2008), we focus on high-dimensional mod- 
els and treat explicitly the sparsity issue. Because of the high dimensionality of the problem, we 
need efficient computational algorithms, and therefore we suggest a version of the Metropolis- 
Hastings algorithm to approximate our estimators (subsection 7.1). Regarding the sparsity issue, 
we prove that our method benefits simultaneously from three types of sparsity. The first one is 
expressed by the small rank of the design matrix X, the second by the small number of non-zero 
components of the target vector, and the third by its small l\ norm. Finally, we mention that in 
a work parallel to ours, Alquier and Lounici (2010) consider exponentially weighted aggregates 
with priors involving both discrete and continuous components and suggest another version of 
the Metropolis-Hastings algorithm to compute them. 

The contributions of this paper are the following: 

(i) We propose the ES estimator which benefits simultaneously from the above mentioned three 
types of sparsity. This follows from the oracle inequalities that we prove in Section 3. We 
also provide an efficient and fast algorithm to approximately compute the ES estimator 
and show that it outperforms several other competitive estimators in a simulation study. 

(ii) We show that the ES estimator attains the optimal rate of sparse estimation. To this end, 
we establish a minimax lower bound which coincides with the upper bound on the risk of 
the ES estimator on the intersection of the £q and l\ balls (Theorem 5.3). 
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(iii) As a consequence, we find optimal rates of aggregation for the regression model with fixed 
design. We consider the five main types of aggregation, which are the linear, convex, model 
selection, subset selection and D-convex aggregation, cf. Nemirovski (2000); Tsybakov 
(2003); Bunea et al. (2007b); Lounici (2007). We show that the optimal rates are different 
from those for the regression model with random design established in Tsybakov (2003). 
Indeed, they turn out to be moderated by the rank of the regression matrix X. The rates 
are faster for the smaller ranks. See Section 6. 

This paper is organized as follows. After setting the problem and the notation in Section 2, 
we introduce the ES estimator in Section 3 and prove that it satisfies a SOI with a remainder term 
obtained as the minimum of the £o and the l\ rate. This result holds with no assumption on the 
design matrix X, except for simple normalization. We put it into perspective in Section 4 where 
we compare it with weaker SOI for the bic and the Lasso estimators. In Sections 5.1 and 5.2 
we discuss the optimality of SOI. In particular, Section 5.1 comments why a minimax result 
in Donoho and Johnstone (1994a) with normalization depending on the unknown parameter is 
not suitable to treat optimality. Instead, we propose to consider minimax optimality on the 
intersection of £q and t\ balls. In Section 5.2 we prove the corresponding minimax lower bound 
for all estimators and show rate optimality of the ES estimator in this sense. Section 6 discusses 
corollaries of our main results for the problem of aggregation; we show that the ES estimator 
solves simultaneously and optimally the five problems of aggregation mentioned in (iii) above. 
Finally, Section 7 presents a simulation study demonstrating a good performance of the ES 
estimator in numerical experiments. 

2 Model and notation 

Let Z := {(x\, Yi), . . . , (x n , Y n )} be a collection of independent random couples such that 
(xi,Yi) £ X x IR, where X is an arbitrary set. Assume the regression model: 

Yi = r](xi) + i = l,...,n, 

where rj : X — > IR is the unknown regression function and the errors £j are independent Gaussian 
A^(0, a 2 ). The covariates are deterministic elements x\, ... ,x n of X. Consider the equivalence 
relation ~ on the space of functions / : X — > IR such that / ~ g if and only if /(xj) = g(x{) for 
all i = 1, . . . , n. Denote by Qi- n the quotient space associated to this equivalence relation and 
define the norm || • || by 

n 

ll/H 2 :=-E/ 2 ^)' f£Ql:n- 
n 

i=l 

Notice that || • || is a norm on the quotient space but only a seminorm on the whole space of 
functions / : X — >■ JR. Hereafter, we refer to it as a norm. We also define the associated inner 
product 

1 - 

(f,9) ■= - y2f(xi)g(xi) . 

8=1 
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Let H := {fi, ■ ■ ■ ,/m}, be a dictionary of M > 1 given functions fj:X—> JR. We approx- 
imate the regression function jj by a linear combination fg(x) = ^2j = i6jfj(x) with weights 
9 = (0i, ... , 6m), where possibly M 3> n. 

We denote by X, the n x M design matrix with elements Xjj = fj(xi), i = 1, . . . ,n, j = 
1, . . . , M. We also introduce the column vectors f = (rj(xi), . . . , r](x n )) T , Y = (Yi, . . . , Y n ) T and 
£ = • • • i^n) T - Let | • | p denote the i v norm in ]R d for p,d>l and M (0) denote the £ norm 
of 6 G IR M , i.e., the number of non-zero elements of 6 G IR M . For two real numbers a and 6 we 
use the notation a A b := min(a, 6), a V 6 := max(a, b); we denote by [a] the integer part of a and 
by [a] the smallest integer greater than or equal to a. 



3 Sparsity pattern aggregation and Exponential Screening 

A sparsity pattern is a binary vector p G V := {0,1} M . The terminology comes from the fact 
that the coordinates of any such vectors can be interpreted as indicators of presence (pj = 1) or 
absence (pj = 0) of a given feature indexed by j G {1, . . . , M}. We denote by |p| the number of 
ones in the sparsity pattern p and by IR P the space defined by 

1R P = {6 ■ p : 6 G M A/ } C K M , 

where 6 ■ p G JR M denotes the Hadamard product between 6 and p and is defined as the vector 
{6-p) j = 6 j p j ,j = l...^M. 

For any p G V, let 6 p be any least squares estimator defined by 

6 P G argmin | Y — X0|| , ( 3-1 ) 
eeRp 

The following simple lemma gives an oracle inequality for the least squares estimator. Let 
rk(X) < M An denote the rank of the design matrix X. 

Lemma 3.1 Fix p G V . Then any least squares estimator 6 P defined in (3.1) satisfies 

JEllfs - r?|| 2 = min \\f e - r?|| 2 + a 2 ^- < min ||f - nil 2 + a 2 ^ AR (3.2) 
y p eeRp n 6»eRp n 

where R p is the dimension of the linear subspace {X.6 : 6 G 1R P } and R = rk(X). Moreover, the 
random variables ,£ n need not be Gaussian for (3.2) to hold. 

Proof of the lemma is straightforward in view of the Pythagorean theorem. 

Let 7T = (vrp)p be a probability measure on V, which we will further call a prior. The sparsity 
pattern aggregate (spa) estimator is defined as fg S PA, where 

£ §p exp ( - h - V 2 ^) 2 - t)*> 

«spa i=1 



1 

£ ex p(-4^£( y *-V^)) 2 -f 

per i=l 
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As shown in Leung and Barron (2006), the following oracle inequality holds: 

4q 2 log (^ p 1 ) 
peP:7Tp^o l " w p '" n 

Now, we consider a specific choice of the prior tt 

IpI 



„9 f „ „9 4fJ 2 log(7T_ X ) 1 

E f,- SPA - r? 2 < min J E f, - 77 2 + ^-E_l L 3.3 



±, if|p| = M, (3-4) 

, otherwise , 



where R = rk(X), we use the convention 0° = 1 and H = 2 X^=o (if) {iZm) ls a normalization 
factor. In this paper we study the SPA estimator with the prior defined in (3.4). We call it the 
Exponential Screening (es) estimator, and denote by 9 ES the estimator 6 SPA with the prior (3.4). 
The ES estimator is a mixture of least squares estimators corresponding essentially to sparsity 
patterns p with small size and small residual sum of squares. Note that the weight 1/2 is assigned 
to the least squares estimator on the whole space (case where |p| = M) and can be changed 
to any other constant in (0, 1) without modifying the rates presented below, as long as H is 
modified accordingly. 

Since (^f) < (^) , we obtain that H < 4. Using this and considering separately the cases 
|p| < 1 and |p| > 2, we obtain that the remainder term in (3.3) satisfies 



4a 2 logfTTp 1 ) 4a 2 



IpI log ( ip| v ! ) +1 °s 4 



n n 

8cr 2 |p| , / eM \ 8a 2 , 
< — log 1 + tt— ) + log 2 



(3.5) 



n V |p| V 1 / n 

for sparsity patterns p such that |p| < R. Together with (3.3), this inequality yields the following 
theorem. 

Theorem 3.1 For any M > l,n > 1, the Exponential Screening estimator satisfies the follow- 
ing sparsity oracle inequality 

2 . f 2 a 2 R 9a 2 M(6) ( eM \\ 8a 2 



E||f - ES - V \\ 2 < min ||f e - rj\\ + A ^ log 1+ - - + log 2 (3.6) 

9&r m I n n \ M[p) V 1/ J n 

where R < M An denotes the rank of the design matrix X. 

Proof. Combining the result of Lemma 3.1 and (3.3) with the sparsity prior defined in (3.4), 
we obtain that E||fgf ES — n\\ 2 is bounded from above by 

,2 , n 2^)i„.A , eM \\ , 8a 2 



M(6)<R 

and by 



min ||fe - V\\ 2 + 9(7'—^ log 1 + + l °Z 2 > (3-7) 

eeR M { n \ M[d) V 1/ J n 



min (||f -??|| 2 +cT 2 -l + — log 2. (3.8) 
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Combining (3.7) and (3.8) concludes the proof. 



An interesting corollary of Theorem 3.1 is obtained for the linear regression model where it 
is assumed that r\ = f#* for some 9* G IR M . In this case (3.6) yields 

%s-fo« f< A L^tog 1+ + log 2. 

" n n \ M(9*) V 1 y n 

However, even in this parametric case, Theorem 3.1 provides a stronger result. Indeed, if there 
exists 9' G IR M , such that 

r ll2 <J 2 R 9a 2 M(9 r ) , / eM \ cr 2 i? 9ct 2 M(0*) , / eM 
||fe' - ffl. II 2 + A ^ log 1+ TT7EKTT- i < A ^ log 1 + 



n n V ^(^') v 1/ " n \ M i°*) v 1 

(3.9) 

then Theorem 3.1 gives a tighter bound on E||fg ES — fg* || 2 . A vector 9' G IR M that satisfies (3.9) 
exists when fg* can be well approximated by fgi and 9' is much sparser than 9* . 

While the sparsity oracle inequality (3.6) indicates that the ES estimator adapts to the 
underlying sparsity when measured in terms of the number of non-zero coefficients M{6), it is 
also adaptive to the sparsity when measured in terms of the l\ norm |0|i = ^ ■ \9\j. This can 
become an advantage when 9 has many small coefficients so that |0|i <C M{9). Indeed, the 
following theorem shows that the ES estimator also enjoys adaptation in terms of its l\ norm. 

Theorem 3.2 Assume that maxi<j<Af ||/j|| < 1. Then for any M > l,n > 1 the Exponential 
Screening estimator satisfies 

2 

E||f - ES -r]\\ 2 < min {\\f e ~ V\\ 2 + Vu,m(°)} + — (91og(l + eM) + 81og2). (3.10) 
een M n 

where ^ n ,Af(0) := and, for 9^0, 



a 2 R 9a 2 M(9), / eM \ lXcrl^N / / 3eMa \ 

VnM0) _ _ a — -ii Log (l + ) A -jU^g (l + ijj^) ■ (3-11) 

Furthermore, for any 9 G H M , such that (fe,v) — 1 1 fell 2 j 1,76 ^awe 

8cr 2 

E||f& s - f?|| 2 < ||fe-r ? || 2 + ^ n ,MW + log 2 (3.12) 



where VVi,Af(0) := and, /or 0^0, 



o- 2 i? 9c 2 M(^) i / eM \ UaWU I ( 3eMa\ .„,» . 

== — A — ^ log (l + jj^; ) A -^Ji^lOg (l + |^=) A 4|«|f . (3.13) 

In particular, if there exists 9* G IR M such that r\ = f#* , we /iaue 

E||f,- ES - f*. || 2 < 1>nM *) + — log 2 . (3.14) 
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The proof of Theorem 3.2 is obtained by combining Theorem 3.1 and Lemma 8.2 in the appendix. 
For brevity, the constants derived from Lemma 8.2 are rounded up to the closest integer. 

It is easy to see that in fact Lemma 8.2 implies a more general result. Not necessarily f^ ES 
but, in general, any estimator satisfying a SOI of the type (3.6) also obeys the oracle inequality 
of the form (3.10), i.e., enjoys adaptation simultaneously in terms of the £q and l\ norms. This 
remains still a theoretical proposal, since we are not aware of estimators satisfying (3.6) apart 
from fs BS . However, there are estimators for which coarser versions of (3.6) are available as 
discussed in the next section. 

4 Sparsity oracle inequalities for the BIC and Lasso estimators 

The aim of this section is to put Theorem 3.2 in perspective by discussing weaker results in the 
same spirit for two popular estimators, namely, the BIC and the Lasso estimators. 
We consider the following version of the BIC estimator, cf. Bunea et al. (2007b): 

6 mc e argmin { -|Y - X#|i + pen(#) 1 , (4.15) 

where 

P en(0) := ^ (l + y/ltf) + ^L(0)) M(9) , 
n y 1 + a a J 

with for some a > and L(8) = 2 log ( m(8)vi ) • Combining Theorem 3.1 in Bunea et al. (2007b) 
and Lemma 8.2 in the appendix we get the following corollary. 

Corollary 4.1 Assume that maxi<j<M \\fj\\ < 1. Then there exists a positive numerical con- 
stant C such that for any M > 2, n > 1 and any a > the BIC estimator satisfies 

E||f fl - Btt . -r?|| 2 < (l + o) min ( ||f - r?|| 2 + C^L Vn , M {8)\ + — , (4.16) 

where (p n) M is defined in (3.11). 

We note that Theorem 3.1 in Bunea et al. (2007b) is stated with R = M and with the additional 
assumption that all the functions fj are uniformly bounded. Nevertheless, this last condition is 
not used in the proof in Bunea et al. (2007b), and the result trivially extends to the framework 
that we consider here. The SOI (4.16) ensures adaptation to sparsity simultaneously in terms 
of the £q and l\ norms. However, it is less precise than the SOI in Theorem 3.2 because the 
leading constant (1 + a) is strictly greater than 1 and the rate deteriorates as the leading constant 
approaches 1, i.e., as a — > 0. Also the computation of the BIC estimator is a hard combinatorial 
problem, exponential in M, and it can be efficiently solved only when the dimension M is small. 
Consider now the Lasso estimator 9 L , i.e., a solution of the minimization problem 

9 L 6 argmin i-|Y - X0|| + A|0|i) , (4.17) 
6]R m [ n J 
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where A > is a tuning parameter. This problem is convex, and there exist several efficient 
algorithms of computing 9 L in polynomial time. 

Our aim here is to present results in the spirit of Theorem 3.2 for the Lasso. They have 
a weaker form than for the ES estimator and for the BIC. In the next theorem, we give a 
SOI in terms of the l\ norm that is similar to those that we have presented for the ES and BIC 
estimators but it is stated in probability rather than in expectation and the logarithmic factor in 
the rate is less accurate. Note that it does not require any restrictive condition on the dictionary 
fl, ■ ■ ■ , !m- 

Theorem 4.1 Assume that m&xi<j<M \\fj\\ < 1- Let M > 2,n > 1 and let 9 L be the Lasso 



estimator defined by (4.17) with A = Aay ^ 1 , where A > 2^/2. Then with probability at least 
1 - M 1_A2 / 8 we have 

||f^-??|| 2 < min - n\\ 2 + 2Aa^^\ogM \ . (4.18) 
Proof. From the definition of L by a simple algebra we get 



\^-r]\\ 2 < \\fe-v\\ 2 + - (0 L -0)'X^ 
u n 



+ A(|0|i-|0 L |i), V0GIR W 



Next, note that P{A) > 1 - M 1 ^ 2 / 8 for the random event A = {j^X 1 ^^ < a| (cf. 
Bickel et al., 2009, eq. (B.4)). Therefore, 

IIV - V\\ 2 < life - r?|| 2 + X\6 L - 9^ + A(|0|i - |0 L |i), V 6 G M M , 

with probability at least 1 — M 1_a2//8 . Thus, (4.18) follows by the triangle inequality and the 
definition of A. | 

The rate J^VlogM in (4.18) is slightly worse than the corresponding i\ term of the rate of 
ES estimator, cf. (3.11) and (3.13). 

In contrast to Theorem 4.1, a SOI in terms of the £q norm for the Lasso is available only 
under strong conditions on the dictionary fi, . . . , fu- Following Bickel et al. (2009), we say that 
the restricted eigenvalue condition RE(s,co) is satisfied for some integer s such that 1 < s < M, 
and a positive number cq if we have: 

l XA b 

K(s,co) := min min =. — — > 0. 

J c{i,...,M}, A^o, Vrl| A Jo | 2 

, , |A Tc| 1 <c |Aj n |i 

Here \J\ is the cardinality of the index set J and we denote by Aj the vector in 1R M that has 
the same coordinates as A on J and zero coordinates on the complement J c of J. A typical 
SOI in terms of the £q norm for the Lasso is given in Theorem 6.1 of Bickel et al. (2009). It 
guarantees that, under the condition RE(s,3 + 4/a) and the assumptions of Theorem 4.1, with 
probability at least 1 — M l ~ A2 ^ , we have 

i.2 / s f„r „ 2 C(l + a) M(6)logM) 

f§,-V 2 < (1 + a) min <^ f e - V + 2( Q , h ^ ■ 4.19 

eeR M :M(e)<s { aK 2 (s,3 + 4/a) n J 
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for all a > and some constant C > depending only on A and a. This oracle inequality is 
substantially weaker than (3.10) and (4.16). Indeed, it is valid under assumption RE(s,3 + 4/a), 
which is a strong condition. Furthermore, the rank of the matrix X does not appear, the 
minimum in (4.19) is taken over the set of sparsity s linked to the properties of the matrix X, 
and the minimal restricted eigenvalue k(s, 3 + 4/a) appears in the denominator. This contrasts 
with inequalities (3.10), (4.16) and (4.18) which hold under no assumption on X, except for 
simple normalization: m&xi<j<M \\fj\\ < 1- Finally, the leading constant in (4.19) is strictly 
larger than 1, and the same comments as for the Bic apply in this respect. 



5 Discussion of the optimality 

5.1 Deficiency of the approach based on function normalization 

Section 3 provides upper bounds on the risk of ES estimator. A natural question is whether 
these bounds are optimal. At first sight, to show the optimality it seems sufficient to prove that 
there exists 9 E IR M and tj such that, for any estimator T, 

1EIIT-7?!! 2 > \\f* - vf + a/t^M^) , 

where c > is some constant independent of n and M. This can be also written in the form 

Ellr-r/H 2 - llfe-7/H 2 
mf sup sup - j— > c (5.1) 

T V 0eR M Vn,M\P) 

where inff denotes the infimum over all estimators. We note that it is possible to prove (5.1) 
under some assumptions on the dictionary /i, fu- However, we do not consider this type of 
results because they do not lead a valid notion of optimality. Indeed, since the rate ijjn,M{9) 
is a function of parameter 9, there exists infinitely many different rate functions ip n M{') f° r 
which (5.1) can be proved and complemented by the corresponding upper bounds. To illustrate 
this point, consider a basic example defined by the following conditions: 

(i) M = n, 

(ii) r] = fff* for some 9* G R n , 

(iii) the Gram matrix \E' = X T X/n is equal to the n x n identity matrix, 

(iv) a 2 = 1. 

This will be further referred to as the diagonal model. It can be equivalently written as a 
Gaussian sequence model 

y i = 9 i + — F =£ i , i = l,...,n, (5.2) 



where (yi, . . . , y n ) T = X T Y/n and ei, . . . , e n are i.i.d. standard Gaussian random variables. 

Clearly, estimation of rj in the diagonal model is equivalent to estimation of 6* in model (5.2), 
and we have the isometry \\fg — rj\\ = \9 — 9*\2- Moreover, it is easy to see that we can consider 
w.l.o.g. only estimators T of the form T = f & for some statistic 9, and that (5.1) for the diagonal 
model follows from a simplified bound 

■ f Eg\9 - 9\l , , 

mi sup ■ — — — > c, (5.3) 



if> n (P) 



9 



where we write Eg to specify the dependence of the expectation upon 9, infg denotes the infimum 
over all estimators, and for brevity ip n {0) = ipn,n(@)- 

Results of the type (5.3) are available in Donoho and Johnstone (1994a) where it is proved 
that, for the diagonal model, 

infsup ^WW = l+o{11 (5 ' 4) 



e 6eR n 



as n — > oo, where 



^(9) = 21ogn<! - + > min ( 9(, -)}. (5.5) 



f 1 / 1 

2 log n{ — h 7 min ( 9f, — 
I n ^-^ \ n 

v i=l 



The expression in curly brackets in (5.5) is the risk of 0-1 (or "keep-or-kill" ) oracle, i.e., the 
minimal risk of the estimators 9 whose components 9j are either equal to yj or to 0. A re- 
lation similar to (5.4), with the infimum taken over a class of thresholding rules, is proved 
in Foster and George (1994). 

The result (5.4) is often wrongly interpreted as the fact that the factor 21ogn is the "un- 
avoidable" price to pay for sparse estimation. In reality this is not true, and (5.4) cannot be 
considered as a basis of valid notion of optimality. Indeed, using the results of Section 3, we are 
going to construct an estimator whose risk is 0(tp^ l 1 (9)) for all 9, and is of order o(tp^ l 1 (9)) for 
some 9, cf. Theorem 5.2 below. So, this estimator improves upon (5.4) not only in constants but 
in the rate; in particular, the exact asymptotic constant appearing in (5.4) is of no importance. 
The reason is that the lower bound for (5.4) in Donoho and Johnstone (1994a) is proved by 
restricting 9 to a small subset of IR n , and the behavior of the risk on other subsets of ]R n can 
be much better. 

Define the rate 

i 1 



= min 



M(0)logn / logn 



+ 

n 



n v n 

which is an asymptotic upper bound on the rate in (3.13) for M = n, n — > oo. 

Theorem 5.1 Consider the diagonal model. Then the Exponential Screening estimator satisfies 



and 



iimsup sup 7mTn\ — ^> l^.b) 



rwoo 6»eR n 

Furthermore, 



rp l/QES n\2 

liminf inf 91 — 12 = 0. (5.7) 



lim inf = 0. (5.8) 
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Proof. We first prove (5.6). From (3.3), Lemma 3.1 and (3.5) we obtain 

E e \e ES - < min {\9 - 9*\ 2 2 + ^(1 + 41og(2en))) + . 

for any 0* G R n . Let G IR™ be the vector with components 9j = 9*I(\9*\ > l/y/n) where /(•) 
denotes the indicator function. Then 

n 

\e-e*\ 2 2 = ^2\e*\ 2 i(\9*\<i/v^), 

and 

M{9) 



1 

£-I(|0J|>l/V^). 
n n 
i=i 

Therefore, 



£ |0 ES - 9*\ 2 < (1 + 41og(2en)) ^min (jfl*| 2 , + 



l\ 41or2 



n 



which implies (5.6). Next, (5.7) is an immediate consequence of (5.8). To prove (5.8) we consider, 
for example, the set Q n = j# G IR" : a/y/n < \9j\ < b/y/n for all 8j / o| where < a < b < oo 
are constants. For all 6 G Q n we have 



<W < Wl ^ + I<U, W 3^» + I, 
1 n n n n 



and 



so that 



V'° 1 (^)>2(min(a 2 ,l)M(0) + l 



log n 



lim sup = 0. (5.9) 



Hence, (5.8) follows. 



Theorem 5.1 shows that the normalizing function (rate) tp^(9) and the result (5.4) cannot be 
considered as a benchmark. Indeed, the risk of the ES estimator is strictly below this bound. It 
attains the rate tp^(8) everywhere on IR™ (cf. (5.6)) and has strictly better rate on some subsets 
of IR n (cf. (5.8), (5.9)). In particular, the ES estimator improves upon the soft thresholding 
estimator, which is known to asymptotically attain the bound (5.4) (cf. Donoho and Johnstone, 
1994a). This is a kind of inadmissibility statement for the rate ^ 1 (9). 

Observe also that the improvement that we obtain is not a "marginal" effect regarding 
signals 9 with small intensity. Indeed, (5.9) is stronger than (5.8) and the set O n is rather 
massive. In particular, the £q norm M{9) in the definition of Q n can be arbitrary, so that Q n 
contains elements 9 with the whole spectrum of t\ norms, from small = a/y/n to very large 
\9\\ = bM/y/n = by/n. Various other examples of n satisfying (5.9) can be readily constructed. 
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So far, we were interested only in the rates. The fact that the constant in (5.6) is equal 
to 2 was of no importance in this argument since on some subsets of IR n we can improve the 
rate. Notice that one can construct estimators having the same properties as those proved for 
qes j n Theorem 5.1 with constant 1 instead of 2 in (5.6). In other words, one can construct an 
estimator 9* whose risk is at least as small as ip^{9)(l + o(l)) everywhere on IR n and attains 
strictly faster rate o(ip^(Q)) on some subsets of IR™. Such an estimator 9* can be obtained by 
aggregating 9 ES with the soft thresholding estimator, as shown in the next theorem. 

Theorem 5.2 Consider the diagonal model. Then there exists a randomized estimator 6* such 
that 

TP \Q* Q\2 

limsup sup = < 1, (5.10) 

where the expectation includes that over the randomizing distribution, and where the normalizing 
functions ip n satisfy 

liminf inf > 1, (5.11) 

n->00 eR n ^(0) 



and 



ib 01 (9) 

liminf sup = oo. (5.12) 

1pn(9) 



n— >oo 



The proof of this theorem is given in the appendix. 



5.2 Minimax optimality on the intersection of £q and £\ balls 

The rate in the upper bound of Theorem 3.2 is the minimum of terms depending on the Iq norm 
M{9) and on the t\ norm cf. (3.13). We would like to derive a corresponding lower bound, 
i.e., to show that this rate of convergence cannot be improved in a minimax sense. Since both £q 
and l\ norms are present in the upper bound, a natural approach is to consider minimax lower 
bounds on the intersection of £q and l\ balls. Here we prove such a lower bound under some 
assumptions on the dictionary % = {/i, . . . , /a/} or, equivalently, on the matrix X. Along with 
the lower bound for one "worst case" dictionary T~L, we also state it uniformly for all dictionaries 
in a certain class. 



5.2.1 Assumptions on the dictionary 

Recall first, that all the results from Section 3 hold under the only condition that the dictionary 
% is composed of functions fj such that ||/,-|| < 1. This condition is very mild compared to 
the assumptions that typically appear in the literature on sparse recovery using t\ penalization 
such as the Lasso or the Dantzig selector. Buhlmann and van de Geer (2009) review a long 
list of such assumptions, including the restricted isometry (RI) property given, for example, in 
Candes (2008) and the restricted eigenvalue (RE) condition of Bickel et al. (2009) described in 
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Section 4. We call them for brevity the L-conditions. Loosely speaking, they ensure that for 
some integer S < M, the design matrix X forms a quasi-isometry from a suitable subset A p 
of IR P into IR n for any p such that |p| < S. Here "quasi-isometry" means that there exist two 
positive constants k and R such that 

„|0|| < L-Jl < R \0g } V0ei p . (5.13) 

While the general thinking is that a design matrix X satisfying an L-condition is favorable, 
we establish below that, somewhat surprisingly, such matrices correspond to the least favorable 
case. 

We now formulate a weak version of the RI condition. For any integer M > 2 and any 
< u < M let V u denote the set of vectors 9 £ {-1,0, 1} M such that M{9) < u. For any 
constants k > 1 and < t < (M A n)/2 let T>(t, k) be the class of design matrices X defined by 
the conditions: 

(i) max II/,- II < 1, 

(ii) there exist k, k > 0, such that k/R > k and 

I Yfl|2 

k\9\ 2 2 <- -<R\9\l V9eV 2 t- (5.14) 

n 

Note that t < t' implies T>(t' , k) < T>(t, k). Examples of matrices X that satisfy (5.14) are given 
in the next subsection. 

In the next subsection we show that the upper bound of Theorem 3.2 matches a minimax 
lower bound which holds uniformly over the class of design matrices T>(S, k). 



5.2.2 Minimax lower bound 

Denote by the distribution of (Yj., . . . , Y n ) where Yi = rj{xi) + £j, i = 1, . . . , n, and by E v the 
corresponding expectation. For any S > and any integers S > l,n > 1,M > 1,R> 1 such 
that R < M A n, define the quantity 



Cn,M,R(S^) :=^-A — log(l + — J A-^=Wlog(^l + -^J A5 2 . (5.15) 



n n 



Note that (n,M,R,(S, 5) = VVi,m(#) where V'n.M is the function (3. 13) with M(9) = S and |6*|i = 5. 
Let m > 1 be the largest integer satisfying 

m< - 6V ™ =, (5.16) 



if such an integer exists. If there is no m > 1 such that (5.16) holds, we set m = 0. Note that 
m < 5y/n/a. 
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Theorem 5.3 Fix 5 > and integers n > 1, M > 2, 1 < S < M. Fix ft > and Ze£ H be any 
dictionary with design matrix X G P(S* A rh, k), where ra = mVl and m is defined in (5.16). 
Then, for any estimator T n , possibly depending on S,S,n,M and T~L, there exists a numerical 
constant c* > 0, such that 

sup sup{^||T n - r/|| 2 - ||f e - r/|| 2 } > c*K( ntMMX) (S,5) , (5.17) 
eeRf\{o} v 
M(e)<s 

\0\i<6 

where rk(X) denotes the rank of X and TR^ is the positive cone of TR M . Moreover, 

sup E fe \\T n - f e || 2 > c*K( nMjTk{X) (S,5) . (5.18) 
eeRf \{o} 

M(0)<S 
\6\x<5 

The proof of this theorem is given in Subsection 8.3 of the appendix. It is worth mentioning that 
the result of Theorem 5.3 is stronger than the minimax lower bounds discussed in Subsection 5.1 
(cf. (5.3)) in the sense that even if 77 = fe.,0* G IR M , where M{9*) and |0*| i are known a priori, 
the rate cannot be improved. 



Define R = l + 



for some constant Co > to be chosen small enough. We 



now show that for each choice of R > 1 such that R < M A n, there exists at least one matrix 
X G T>(R/2, k) such that R < rk(X) < R. A basic example is the following. Take the elements 
Xjj = fj(xi),i = 1, . . . , n,j = 1, . . . , M, of matrix X as 

f tij.fZi if i < R, 
Xi, j = { J V R ~ ' (5.19) 
[ otherwise , 

where < i <, 1 < j < M are i.i.d. Rademacher random variables, i.e., random variables 

taking values 1 and —1 with probability 1/2. First, it is clear that then ||/j|| < 1, j = 1, . . . , M. 
Next, condition (ii) in the definition of T>(R/2,k) follows from the results on RI properties 
of Rademacher matrices. Many such results have been derived and we focus only on that 
of Baraniuk et al. (2008) because of its simplicity. Indeed, Theorem 5.2 in Baraniuk et al. (2008) 
ensures not only that for an integer S' < M An there exist design matrices in D(S' /2, k) but 
also that most of the design matrices X with i.i.d. Rademacher entries ej j are in D(S' /2, k) for 
some k > as long as there exists a constant Co small enough such that the condition 

M^lo g (l + f)<C < 5 ' 2 °> 

is satisfied. Specifically, Theorem 5.2 in Baraniuk et al. (2008) ensures that if X' is the R x M 
matrix composed of the first R rows of X with elements as defined in (5.19), and 

|log(l + ^)<C (5.21) 
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holds for small enough Co, then 



k™\0$< £2!I< S ~|0|§, \/6:M(9)<R, 
R R R 

with probability close to 1 which in turn implies (ii) with t = R/2. As a result, the above 
construction yields X G D(R/2,n) that has rank bracketed by R and R since (5.21) holds by 
our definition of R. 

In what follows Co is the constant in (5.20) small enough to ensure that Theorem 5.2 
in Baraniuk et al. (2008) holds, and we assume w.l.o.g. that Co < 1. 

Using the above remarks and Theorem 5.3 we obtain the following result. 



Theorem 5.4 Fix 5 > and integers n>l,M>2,l<S< M, R > 1. Moreover, assume that 
1 + ^1 log(l + eM/R) < M An. Then there exists a dictionary H composed of functions fj with 
maxi<j<jv/ \\fj\\ < 1, R < rk(X) < 1 + ^r- log(l + eM/R), and a constant c* > such that 

inf sup sup{^||T n -?7|| 2 - ||f e -r?|| 2 } > c*C„,M,rk(x)('S',5). (5.22) 
T " 6»eRf \{0} V 

M(e)<s 

\8\i<5 

where the infimum is taken over all estimators. Moreover, 

inf sup E f0 \\T n - f e || 2 > c*Cn,M,rk(x) (S, 5) . (5.23) 

T " 8eRf\{0} 

M(e)<s 
\e\i<& 

Proof. Let X be a random matrix constructed as in (5.19) so that the rank of X is bracketed 
by R and R and X G D(R/2, k). We consider two cases. Assume first that S < R/2 so that 
X G D(R/2, k) C T>(S, k) C D(S Am, k) and the result follows trivially from Theorem 5.3. Next, 
if S > R/2, observe that 

, , R ( eM\ 2 ( 2eM\ 

rk(X)< J R<l + — log 1 + — <— i?log 1 + 



(we used here that Co < 1), so that 

rk(X) A Slog (l + e -^f \ < rk(X) < A f rk (X) A fllog (l + . 

It yields Cn,M,rk(x) (^j < CCn,M,rk(x)(-R/2, 5) and the result follows from Theorem 5.3, which 
ensures that 

inf sup sup {E v \\T n - n\\ 2 - ||f e - n\\ 2 } > c*KC niM ,rk(x)(-R/2, S) > c*Cn,M,rk(x) 5) . 
M(9)<S 

\e\i<6 
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As a consequence of Theorem 5.3 we get a lower bound on the Iq ball Bq(S) = {9 : M{9) < S} 
by formally setting 5 = oo in (5.17): 



a 2 



SUp SUp {£^||T n - 7/|| - \\f 9 -7/11 } > C*K 



V 6»eR 



m n 



rk(X) A ,S'],).n ( 1 + ^ 



(5.24) 



M(6)<S 



and the same type of bound derived from (5.18). Analogous considerations lead to the following 
lower bound on the l\ ball B\{$) = {6 : \9\% < 5} when setting S = M: 



rnllm ll2 llr / fJ 2 rk(X) a5 I ( eMa\ r2 \ 



\eu<5 



and to the same type of bound derived from (5.18). 

Consider now the linear regression, i.e., assume that there exists 6* such that r/ = f#*. 
Comparing (3.14) with (5.18) we find that for 5 > l/\/n the rate Cn,M,rk(x) ^) is the minimax 
rate of convergence on Bq(S) Pi B\(5) and that the ES estimator is rate optimal. Moreover, it is 
rate optimal separately on Bq(S) and B\ (6), and the minimax rates on these sets are given by 
the right hand sides of (5.24) and (5.25) respectively. 

For the diagonal model (cf. Subsection 5.1), asymptotic lower bounds and exact asymp- 
totics of the minimax risk on £ q balls were studied by Donoho et al. (1992) for q = and 
by Donoho and Johnstone (1994b) for < q < oo. These results were further refined by 
Abramovich et al. (2006). In the £q case, Donoho et al. (1992) exhibit a minimax rate over 
Bq(S) that is asymptotically equivalent to 



,5 , / n 
2a 2 - log - 
n \S 



as M = n — > oo . 



In the l\ case, Donoho and Johnstone (1994b) prove that the minimax rate over an i\ ball with 
radius 5 is asymptotically equivalent to 



5a I . ( cr v /n N . , r 

—j= \ 2 log I — ^ — ) asM = n^oo. 



n V V 5 



In both cases, the above rates are equivalent, up to a numerical constant, to the asymptotics of 
the right hand sides of (5.24) and (5.25) under the diagonal model. We note that the results 
of those papers are valid under some restrictions on asymptotical behavior of S (resp. 5) as a 
function of n. 

Recently Raskutti et al. (2009) extended the study of asymptotic lower bounds on t q balls 
(0 < q < 1) to the non-diagonal case with M ^ n. Their results hold under some restrictions 
on the joint asymptotic behavior of n, M and S (respectively, 5). The minimax rates on the £o 
and i\ balls obtained in Raskutti et al. (2009, Theorem 3) are similar to (5.24) and (5.25) but, 
because of the specific asymptotics, some effects are wiped out there. For example, the i\ rate 
in Raskutti et al. (2009) is 5y (log M)/n, whereas (5.25) reveals an elbow effect that translates 
into different rates for ark(X) < 5^fn. Furthermore, the dependence on the rank of X does 
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not appear in Raskutti et al. (2009), since under their assumptions rk(X) = n. Theorem 5.3 
above gives a stronger result since it is (i) non-asymptotic, (ii) it explicitly depends on the 
rank rk(X) of the design matrix and (hi) it holds on the intersection of the £q and l\ balls. 
Moreover, Theorem 3.2 shows that the £q — £\ lower bound is attained by one single estimator: 
the Exponential Screening estimator. Alternatively, Raskutti et al. (2009) treat the two cases 
separately, providing two lower bounds and two different estimators that attain them in some 
specific asymptotics. 

6 Universal aggregation 

Combining the elements of a dictionary % = {/x, . . . , Jm} to estimate a regression function rj 
originates from the problem of aggregation introduced by Nemirovski (2000). It can be generally 
described as follows. Given C JR. , the goal of aggregation is to construct an estimator f n 
that satisfies an oracle inequality of the form 

E||/ n - 77 || 2 <min||f - 77 || 2 + (7A ni M(e), C>0, (6.1) 

with the smallest possible (in a minimax sense) remainder term A„ m(®)i m which case A ni A/(0) 
is called optimal rate of aggregation, cf. Tsybakov (2003). Nemirovski (2000) identified three 
types of aggregation: (MS) for model selection, (C) for convex and (L) for linear. Bunea et al. 
(2007b) also considered another collection of aggregation problems, denoted by (Ld) for subset 
selection and indexed by D G {1, . . . ,M}. To each of these problems corresponds a given set 
C IR and an optimal remainder term A n( jy(0). For (MS) aggregation, = 0(ms) = 
-Bo(l) H -Bi(l) = {ei, . . . , cm}) where ej is the j-th vector of the canonical basis of 1R M . For (C) 
aggregation, = 0( C ) is a convex compact subset of the simplex B\(l) = {9 G 1R M : \9\\ < 1}. 
The main example of {f$,9 G 0(C)} is the set of all convex combinations the ffs. For (L) 
aggregation, = 0( L ) = IR A/ = Bq(M), so that {f#, 9 G 0(l)} is the set of all linear combinations 
the /j's. Given an integer D G {1, . . . , M}, for (Ld) aggregation, = 0(l d ) = Bq(D) = {9 € 
IR : M{9) < D}. For this problem, {fe,9 G 0(l d )} is the set of all linear combinations of at 
most D of the fj's. 

Note that all these sets are of the form Bq(S) n Bi(5) for specific values of S and 5. This 
allows us to apply the previous theory. 

Table 1 presents the four different choices for together with the optimal remainder terms 
given by Bunea et al. (2007b). For (MS), (C) and (L) aggregation they coincide with optimal 
rates of aggregation originally proved in Tsybakov (2003) for the regression model with i.i.d. 
random design and integral L2 norm in the risk. A fifth type of aggregation called the D- convex 
aggregation, which we denote by (Cd) was studied by Lounici (2007). In this case, = ©(c) is 
a convex compact subset of -Bi(l) H Bq(D), so that {f#, 9 G @(C D )} can be, as a typical example, 
the set of convex combinations of at most D of the /,'s. Lounici (2007) proves minimax lower 
bounds together with an upper bound that departs from the lower bound by logarithmic terms. 
However, the results hold in the i.i.d. random design setting and do not extend to our setup. 
While several papers use different estimators for different aggregation problems (see Tsybakov, 
2003; Rigollet, 2009), one contribution of Bunea et al. (2007b) was to show that the bic estimator 
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Problem 


e 


A n , A /(e) 


(MS) 
(C) 

(L) 


e (MS) = B (i)n5i(i) 

6(0=^(1) 

e (L) = b (m) 
e {Lo) = B (z>) 


log M 
n 


f A ^108(1 + $) 

M 
n 

#log(l + ^) 



Table 1: Sets of parameters 6(ms)> ©(C) j ®(L) an d ®(L D ) an d corresponding optimal rates of 
aggregation presented in Bunea et al. (2007b). Note that Bunea et al. (2007b) considered a 
slightly different definition in the (C) case: 6(c) = Bi(l) n leading to the same rate. 



defined in Section 4 satisfies oracle inequalities of the form 



E||f™ c - r]\\ 2 < (1 + a) min ||f e - r) 



| 2 + ci±^A n , M (e) 



(6.2) 



simultaneously for all the sets O presented in Table 1. Here a and C positive constants. More- 
over, for the Lasso estimator defined in (4.17), Bunea et al. (2007b) show less precise inequalities 
under the assumption the matrix X T X is positive definite, where X is the design matrix defined 
in Section 2. Note that these oracle inequalities are not sharp since the leading constant is 
1 + a and not 1, whereas letting a — > results in blowing up the remainder term. The following 
theorem shows that the Exponential Screening estimator satisfies sharp oracle inequalities (i.e., 
with leading constant 1) that hold simultaneously for the five problems of aggregation. 

Theorem 6.1 Assume that maxi<j<M \\fj\\ < 1- Then for any M > 2,n > 1,D < M, and 
G {©(ms)> ©(C); ©(L)j ©(L D )) ©(C D )} Exponential Screening estimator satisfies the following 
oracle inequality 



r?|| 2 < min llfo 
flee 



where C > is a numerical constant and 



A*m(©) 



A 



u 2 R » o- 2 log Al 
n n 

cr 2 R 
n 

a 2 R 
n 

u 2 R 



2l\ og (l + eM* 



^A^log(l + ^) 

a 2 R 



A 



^ log ( 1 + ^S- 



A 



2 D 



io g (i + ^) ife 



if e 


— ©(MS) 


if e 


= ©(C), 


if e 


= ©(L)> 


if e 


= 


if e 


= ®(C D ) 



The proof of Theorem 6.1 follows directly from (3.10) and (3.11). 

We also observe that A* M (Q) is to within a constant factor of A* M (Q) A 1 since 

<a 2 . 



a 2 R 



< 



a 2 (MAn) 
n 
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Using Theorems 5.3 and 5.4 it is not hard to show that the rates A* M (B) A 1 for A* M (G) 
listed in Theorem 6.1 are optimal rates of aggregation in the sense of Tsybakov (2003). Indeed, 
it means to prove that there exists a dictionary % satisfying the assumptions of Theorem 5.4, 
and a constant c > such that the following lower bound holds: 

inf sup \E v \\T n - 7]\\ 2 - min \\f e - ^f) > c(A* M (G) A 1) , (6.3) 
T n r) I dee J 

where the infimum is taken over all estimators. An important observation here is that the left 
hand side of (6.3) is greater than or equal to 

infsup£ f jT, t -f e || 2 . (6.4) 

It remains to note that a lower bound for (6.4) with the rate A* M {@) A 1 follows directly from 
Theorem 5.4 (cf. also (5.24) and (5.25)) applied with the values S and 5 corresponding to the 
definition of 0. 

Interestingly, the rates given in Theorem 6.1 are different from those in Table 1, and also 
from those for the regression model with i.i.d. random design established in Tsybakov (2003) 
and Lounici (2007). Indeed, they depend on the rank R of the regression matrix X, and the 
bounds are better when the rank is smaller. This is quite natural since the distance ||fs ES — v\\ 
is the "empirical distance" depending on X. One can easily understand it from the analogy 
with the behavior of the ordinary least squares estimator, cf. Lemma 3.1. Alternatively, the 
distance used in Tsybakov (2003) and Lounici (2007) for the i.i.d. random design setting is the 
-^2(-Px)-distance where Px is the marginal distribution of AVs, and no effects related to the rank 
can occur. As concerns Table 1, the optimality of the rates given there is proved in Bunea et al. 
(2007b) only for M < n and X T X/n equal to the identity matrix, in which case R = M and 
thus the effect of R is not visible. 



7 Implementation and numerical illustration 

In this section, we propose an implementation of the ES estimator together with a numerical 
experiment both on artificial and real data. We suppose throughout that the sample is fixed, so 
that the least squares estimators 8 P , p £ V, are fixed vectors. 

7.1 Implementation via Metropolis approximation 

Recall that the ES estimator 9 ES is the following mixture of least squares estimators: 



1 n II 



4(J 2 Z^- A™ 2 

5es . = tl 

1 n II 

£ ex p(-i^£^-V^)) 2 -T 

pev i=i 



TTp 



(7.1) 



where V := {0, 1} A/ , it is the prior (3.4), and 9 p is the least squares estimator on IR P . 
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Recall also that the prior it defined in (3.4) assigns weight 1/2 to the ordinary least squares 
estimator 0\, where 1 = (1, . . . , 1) S V . It is not hard to check from the proof of Theorem 3.1 
that it allows us to cap the rates by a 2 R/n. While this upper bound has important theoretical 
consequences, in the examples that we consider in this section, we typically have R = n so that 
the dependence of the rates in it! is inconsequential. As a result, in the rest of the Section, we 
consider the following, simpler prior 

f / , , \ IdI 

if |p| < R, 




(7.2) 

otherwise . 

Exact computation of fg ES requires the computation of 2 R ~ l least squares estimators. In many 
applications this number is prohibitively large and we need to resort to a numerical approxi- 
mation. Notice that 9 ES is obtained as the expectation of the random variable Op where P is a 
random variable taking values in V with probability mass function v given by 

1 71 I Pi 

i=i 

This Gibbs-type distribution can be expressed as the stationary distribution of the Markov 
chain generated by the Metropolis-Hastings (MH) algorithm (see, e.g., Robert and Casella, 2004, 
Section 7.3). We now describe the MH algorithm employed here. Consider the M-hypercube 
graph Q with vertices given by V. For any p € ?, define the instrumental distribution g(-|p) 
as the uniform distribution on the neighbors of p in Q and notice that since each vertex has 
the same number of neighbors, we have ?(p|q) = f7(q | p) for any p, q £ V. The MH algorithm 
is defined in Figure 7.1. We use here the uniform instrumental distribution for the sake of 
simplicity. Our simulations show that it yields satisfactory results both in performance and in 
the speed. Another choice of q(-\-) can potentially further accelerate the convergence of the MH 
algorithm. 

The following theorem ensures the ergodicity of the Markov chain generated by the MH 
algorithm. 

Theorem 7.1 For any function p i— >• 6 p G JR M , the Markov chain (Pt)t>o defined by the MH 
algorithm satisfies 

1 To+T 

t=T +i P ev 
where Tq > is an arbitrary integer. 

Proof. The chain is clearly ^-irreducible, so the result follows from Robert and Casella (2004, 
Theorem 7.4, p. 274). | 



In view of this result, we approximate 6 ES = ^2 p& -p OpVp by 



T +T 



9f? 



t=T +l 
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Fix p = € TR M . For any t > 0, given p t € 7>, 

1. Generate a random variable Qt with distribution q(-\pt) 

2. Generate a random variable 



t+i 



Qt with probability r(pt,Qt) 
p t with probability 1 — r(p t , Qt) 



where 



r(p,q) = min l^j 
3. Compute the least squares estimator 9p t+1 - 



Figure 1: The Metropolis-Hastings algorithm on the M-hypercube. 

which is close to 8 ES for sufficiently large T. One salient feature of the MH algorithm is that it 
involves only the ratios fq/V p where p and q are two neighbors in Q. Since 

v = exp t [(* - - w - %&)) 2 } + M T M ) I , 

p 1=1 p 

the MH algorithm benefits from the choice (7.2) of the prior fr in terms of speed. Indeed, for 
this prior, we have 

n p \ + \p\) \2eM) ' 
and to = |q| — |p| G {—1, 1} when p and q are two neighbors in Q. In this respect, the choice of the 
prior fr as in (7.2) is better than the suggestions in Leung and Barron (2006) and Giraud (2008) 
who consider priors that require the computation of the combinatoric quantity (j^j) . Moreover, 
the choice (7.2) yields slightly better constants and improves the remainder terms in the oracle 
inequalities of Section 3, as compared to what would be obtained with those priors. 

As a result, the MH algorithm in this case takes the form of a stochastic greedy algorithm 
with averaging, which measures a tradeoff between sparsity and prediction to decide whether to 
add or remove a variable. In all subsequent examples, we use a pure MATLAB implementation 
of the ES estimator. While the benchmark estimators considered below employ a C based code 
optimized for speed, we observed that a safe implementation of the MH algorithm (three time 
more iterations than needed) exhibited an increase of computation time of at most a factor two. 



7.2 Numerical experiments 
7.2.1 Sparse recovery 

While our results for the ES estimator hold under no assumption on the dictionary, we first 
compare the behavior of our algorithm in a well-known example where the L-conditions on the 
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dictionary are satisfied and therefore sparse recovery by ^i-penalized techniques is theoretically 
achievable. 

Consider the model Y = X0*+<r^, where X is an nxM matrix with independent Rademacher 
or standard Gaussian entries and £ 6 ]R n is a vector of independent standard Gaussian random 
variables and is independent of X. The vector 9* is given by 6j = I(j < S) for some fixed S 
so that M(6*) = S. The variance is chosen as a 2 = S/9 following the numerical experiments of 
Candes and Tao (2007, Section 4). For different values of (n, M, S), we run the ES algorithm on 
500 replications of the problem and compare our results with several other popular estimators 
in the sparse recovery literature. We limit our choice to estimators that are readily implemented 
in R or MATLAB. The considered estimators are: 

1. The Lasso estimator with regularization parameter cry / 8(log M)/n as indicated in Bickel et al. 
(2009), 

2. The cross-validated Lasso estimator (LassoCV) with regularization parameter obtained by 
ten-fold cross-validation, 

3. The Lasso-Gauss estimator (Lasso-G) corresponding to the Lasso estimator computed in 
1., and threshold value given by <Ty/(2 log M)/n, 

4. The cross-validated Lasso-Gauss estimator (LassoCV-G) corresponding to the Lasso esti- 
mator computed in 2., and threshold value given by a^J (2 log M) Jn, 

5. The MC+ estimator of Zhang (2010) with regularization parameter o\J (2 log M)/n, 

6. The SCAD estimator of Fan and Li (2001) with regularization parameter a^J (2 log M)/n. 

The Lasso-Gauss estimators in 3. and 4. are obtained using the following two-step procedure. 
In the first step, a Lasso estimator (Lasso or LassoCV) is computed and only coordinates larger 
than the threshold cr y / 2(log M)/n are retained in a set J . In the second step, the Lasso- 
Gauss estimators are obtained by constrained least squares under the constraint that coordinates 
$j J are equal to 0. Indeed, it is usually observed that the Lasso estimator induces a 
strong bias by over-shrinking large coefficients and the Lasso-Gauss procedure is a practically 
efficient remedy to this issue. By construction, the SCAD and MC+ estimators should not suffer 
from such a shrinkage. The Lasso estimators are based on the 11-ls package in MATLAB 
(Koh et al., 2008). The MC+ and SCAD estimators are implemented in the plus package in R 
(Zhang and Melnik, 2009). 

The performance of each of the seven estimators generically denoted by is measured by 
its prediction error |X(# — 6*)\2/n = ||f^ — f#*|| 2 . Moreover, even though the estimation error 
\6 — 6*\ 2 is not studied above, we also report its values in Table 3, for a better comparison with 
other simulation studies. We considered the cases (n,M,S) G {(100, 200, 10), (200, 500, 20)}. 

The Metropolis approximation was computed with To = 3, 000, T = 7, 000, which should be 
in the asymptotic regime of the Markov chain since Figure 3 shows that on a typical example, 
the right sparsity pattern is recovered after about 2,000 iterations. 

Figure 2 displays comparative boxplots for both Gaussian and Rademacher design matrix. 
In particular, it shows that ES outperforms all six other estimators and has less variability across 
repetitions. 
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Figure 2: Boxplots of |X(6> — ^*)||/ri over 500 realizations for the es, Lasso, cross- validated Lasso 
(LassoCV), Lasso-Gauss (Lasso-G), cross- validated Lasso-Gauss (LassoCV-G), MC+ and SCAD 
estimators. Left: (n,M,S) = (100,200,10), right: (n,M,S) = (200,500,20), top: Gaussian 
design, bottom: Rademacher design. 



Figure 3 illustrates a typical behavior of the ES estimator for one particular realization of 
X and £. For better visibility, both displays represent only the 50 first coordinates of 05? , with 
T = 7, 000 and To = 3, 000. The left hand side display shows that the sparsity pattern is well 
recovered and the estimated values are close to one. The right hand side display illustrates the 
evolution of the intermediate parameter 9p t for t = 1,...,5000. It is clear that the Markov 
chain that runs on the M-hypercube graph gets trapped in the vertex that corresponds to the 
sparsity pattern of 9* after only 2, 000 iterations. As a result, while the ES estimator is not sparse 
itself, the MH approximation to the ES estimator may output a sparse solution. A covariate 
Xj is considered to be selected by an estimator 6, if \9j\ > 1/n. Hence, for any two vectors 

5 0(2) g jpA-f define 9^ A 9^ € {0, 1} M as the binary vector with j-th coordinate given by 

(0W A 9^)j = l{\9f\ > l/n,9f = 0) + l(9f = 0, \9f\ > 1/n) . 
The performance of an estimator 9 in terms of model selection is measured by the number 
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(M,n,b) 


ES 


Lasso 


EassoU V 


Easso-Li 


EassoL- 


MC+ 


SCAD 


(100,200,10) 


0.12 


1.47 


0.99 


0.75 


0.35 


0.41 


0.86 




(0.07) 


(0.31) 


(0.40) 


(0.77) 


(0.53) 


(0.20) 


(0.40) 


(200, 500, 20) 


0.24 


3.39 


1 01 


2.55 


0.70 


1 C\'~7 

1.07 


2.37 




(0.10) 


(0.50) 


(0.50) 


(E45) 


(0.76) 


(0.35) 


(0.64) 


(M,n,S) 


ES 


Lasso 


LassoCV 


Lasso-G 


LassoCV-G 


MC+ 


SCAD 


(100,200,10) 


0.12 


1.48 


0.99 


0.70 


0.30 


0.39 


0.83 




(0.06) 


(0.31) 


(0.38) 


(0.79) 


(0.47) 


(0.19) 


(0.39) 


(200,500,20) 


0.24 


3.32 


1.76 


2.34 


0.66 


1.05 


2.37 




(0.09) 


(0.49) 


(0.49) 


(E44) 


(0.74) 


(0.33) 


(0.61) 



Table 2: Means and standard deviations of |X(0 — 9*)fy/n over 500 realizations for the ES, 
Lasso, cross- validated Lasso (LassoCV), Lasso-Gauss (Lasso-G), cross- validated Lasso-Gauss 
(LassoCV-G), MC+ and SCAD estimators. Top: Gaussian design, bottom: Rademacher design. 



M{9 A 9*) of variables that are incorrectly selected or incorrectly left out of the model. Among 
the four procedures considered here, MC+ uniformly dominates the other three in terms of 
model selection. Table 4 displays the relative average model selection error (RAMS) over 500 
repetitions of each of the experiments described above: 



Ef=i M(gW A 9*) 

J2 5 i=i m(^»' mc + a e* 



RAMS(6>) = -^=±-± >— , (7.3) 



where for each repetition i of the experiment, #W> MC + denotes the MC+ estimator and 9^ is one 
of the four estimators: ES, Lasso, MC+ or SCAD. 

While MC+ uniformly dominates the three other procedures, the model selection properties 
of ES are better than Lasso but not as good as SCAD and the relative performance of ES improves 
when the problem size increases. The superiority of MC+ and SCAD does not come as a surprise 
as these procedures are designed for variable selection. However, ES makes up for this deficiency 
by having much better estimation and prediction properties. 

To conclude this numerical experiment in the linear regression model, notice that we used 
the knowledge of the variance parameter a 2 to construct the estimators, except for those based 
on cross-validation. In particular, ES depends on a 2 and it necessary to be able to implement 
it without such a knowledge. While an obvious solution consists in resorting to cross-validation 
or bootstrap, such procedures tend to become computationally burdensome. We propose the 
following estimator for a 2 . Let 9 ES denote the estimator obtained by replacing a 2 with any upper 
bound a 2 > a 2 in the definition (7.1) of the ES estimator. Define 

|Y -X9 ES (s 2 )\ 2 o 



a 2 = inf s 2 



n- M n (9 ES (s 2 )) 



s 2 



> a 



where a > is a tolerance parameter and for any 9 6 IR M , M n (9) = Y2j=i ^(l^il > V n )- ^ s 
a result, the proposed estimator a 2 is the smallest positive value that departs from the usual 
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(M,n,b) 


ES 


Lasso 


EassoU V 


Easso-Li 


EassoL- 


MC+ 


SCAD 


(100,200,10) 


0.14 


2.06 


1.42 


1.08 


0.48 


0.56 


1.30 




(0.12) 


(0.72) 


(0.66) 


(1.22) 


(0.84) 


(0.34) 


(0.81) 


(zUU, o(JU, zUJ 


U. J,l 


A TO 

4. 11 


2. Id 


3.62 


n no 
(J.yo 


1 A K 

1.45 


3.51 




(0.13) 


(1.24) 


(0.88) 


(2.29) 


(1.13) 


(0.63) 


(1.33) 


(M,n,S) 


ES 


Lasso 


LassoCV 


Lasso-G 


LassoCV-G 


MC+ 


SCAD 


(100,200,10) 


0.13 


1.99 


1.37 


0.94 


0.38 


0.51 


1.21 




(0.07) 


(0.71) 


(0.60) 


(1.19) 


(0.68) 


(0.35) 


(0.81) 


(200,500,20) 


0.26 


4.50 


2.60 


3.20 


0.82 


1.38 


3.44 




(0.11) 


(1.14) 


(0.80) 


(2.20) 


(1.00) 


(0.56) 


(1.22) 



Table 3: Means and standard deviations of \6 — 6*%. over 500 realizations for the ES, Lasso, cross- 
validated Lasso (LassoCV), Lasso-Gauss (Lasso-G), cross- validated Lasso-Gauss (LassoCV-G), 
MC+ and SCAD estimators. Top: Gaussian design, bottom: Rademacher design. 



Design (M, n, S) 


ES 


Lasso 


MC+ 


SCAD 


Gauss. (100,200,10) 


10.54 


12.43 


1.00 


3.56 


Gauss. (200,500,20) 


9.26 


15.81 


1.00 


6.04 


Rad. (100,200,10) 


13.18 


15.80 


1.00 


3.59 


Rad. (200,500,20) 


10.07 


16.18 


1.00 


6.18 



Table 4: Relative average model selection error (RAMS) denned in (7.3) over 500 realizations for 
the ES, Lasso, MC+ and SCAD estimators. Top: Gaussian design, bottom: Rademacher design. 

estimator for the variance by more than a. The motivation for this estimator comes from the 
following heuristics, which is loosely inspired by Zhang (2010, Section 5.2). It follows from the 
results of Leung and Barron (2006) that 6 ES (a 2 ) satisfies the oracle inequalities of Section 3 and 
thus of Section 6 with a 2 replaced by a 2 . As a consequence, we can use any upper bound 
a 2 > a 2 to compute an estimator 6 ES (a 2 ) and thus, an estimator of the variance based on the 
residuals. Our heuristics consists in choosing the smallest upper bound that is inconsistent with 
the estimator based on the residuals. Figure 4 and Table 5 summarize the performance of the 
variance estimator a 2 and the corresponding ES estimator ^ ES (<7 2 ) for a = 1. 

Notice that in Table 5, the obtained values are comparable to those in Tables 2 and 3. 
It is worth noticing that the experiment with Gaussian design and (M, n, S) = (200, 500, 20) 
suffers from a long tail of relatively poor performance (30 realizations out of 500 are outliers) 
that deteriorates both the average performance and its standard deviation. Nevertheless, it is 
remarkable that the ES estimator with such estimator of the variance still has smaller prediction 
and estimation errors in these experiments than the other six considered methods. 
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Figure 3: Typical realization for (M, n, S) = (500, 200, 20) and Gaussian design. Left: Value 

of the 0fF, T = 7, 000, T = 3, 000. Right: Value of Pt for t = 1, . . . , 5000. Only the first 50 
coordinates are shown for each vector. 

7.2.2 Handwritten digits dataset 

The aim of this subsection is to illustrate the performance of the ES algorithm on a real dataset 
and to compare it with the state-of-the-art procedure in sparse estimation, namely the Lasso. 
While sparse estimation is the object of many recent statistical studies, it is still hard to find a 
freely available benchmark dataset where M 3> n. We propose the following real dataset orig- 
inally introduced in LeCun et al. (1990) and, in the particular instance of this paper, obtained 
from the webpage of the book by Hastie et al. (2001). We observe a grayscale image of size 
16 x 16 pixels of the handwritten digit "6" (see Figure 6) which is artificially corrupted by a 
Gaussian noise. Formally, we can write 

Y = n + a£ , (7.4) 

where Y G 1R 256 is the observed image, fi E [0, l] 256 is the true image, a > and £ G IR 256 is a 
standard Gaussian vector. Therefore the number of observations is equal to the number of pixels: 
n = 256. The goal is to reconstruct fj, using linear combinations of vectors x\, . . . , xm G [0, l] 256 
that form a dictionary of size M = 7, 290. Each vector Xj is a 16 x 16 grayscale image of a 
handwritten digit from to 9. As a result, strongly correlated as illustrated by the 

correlation matrix displayed in Figure 5. The digit "6" is a notably hard instance due to its 
similarity with the digits "0" and with some instances of the digit "5" (See Figure 9). Given an 
estimator 9, the performance is measured by the prediction error — X#| 2 , where X is the n x M 
design matrix formed by horizontal concatenation of the column vectors x±, . . . ,xm G IR™- 

Figures 6 and 7 illustrate the reconstruction of this digit by the ES, Lasso and Lasso-Gauss 
estimators for a = 0.5 and a = 1 respectively. The latter two estimators were computed with 
fixed regularization parameter equal to a ^8 (log M)/n and the threshold for the Lasso-Gauss 
estimator was taken equal to a^/2(\og M)/n. It is clear from those figures that the Lasso 
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0.5 

Gauss. (100, 200, 10) Rad. (100, 200, 10) 



0.5 

Gauss. (200, 500, 20) Rad. (200, 500, 20) 



Figure 4: Boxplots of the estimated variance a 2 based on 500 replications of each of the four ex- 
periments described above. The horizontal dashed lines indicate the value of the true parameter 
a 2 = 5/9. Left: a 2 = 1.11. Right: a 2 = 2.22. 

estimator reconstructs the noisy image and not the true one indicating that the regularization 
parameter <Ty / 8(log M)/n may be too small for this problem. 

For both a = 0.5 and a = 1, the experiment was repeated 250 times and the predictive 
performance of ES was compared with that of the Lasso and Lasso-Gauss estimators. The 
results are represented in Figure 8 and Table 6. 

To conclude, we mention a byproduct of this simulation study. The coefficients of 9 ES can 
be used to perform multi-class classification following the idea of Wright et al. (2009). The 
procedure consists in performing a majority vote on the features Xj that are positively weighted 
by 8 ES , i.e., such that 9 ES > 0. For the particular instance illustrated in Figure 6 (c), we see in 
Figure 9 that only a few features Xj receive a large positive weight and that a majority of those 
correspond to the digit " 6" . 
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Design (M, n, S) 


|X(# ES (<7 2 ) -0*)\l/n 


\e™(d 2 )-e*\l 


Gauss. (100,200,10) 


0.12 


0.14 




(0.09) 


(0.14) 


Gauss. (200,500,20) 


0.26 


0.31 




(0.19) 


(0.32) 


Rad. (100,200,10) 


0.12 


0.13 




(0.07) 


(0.08) 


Rad. (200,500,20) 


0.25 


0.28 




(0.11) 


(0.14) 



Table 5: Means and standard deviations of prediction error |X(# ES (<r 2 ) — #*)| 2 ,M and estimation 
error \9 ES (a 2 ) — 9*\\ over 500 realizations for the ES estimator 6 ES (a 2 ) with estimated variance. 





ES 


Lasso 


Lasso-Gauss 


a = 0.5 


26.57 


59.49 


40.55 




(4.57) 


(5.28) 


(14.58) 


a = 1.0 


51.70 


239.39 


82.95 




(12.32) 


(22.12) 


(24.40) 



Table 6: Means and standard deviations for \[x — over 250 realizations of the ES, Lasso and 
Lasso-Gauss estimators to reconstruct the digit "6" . 




Correlation coefficient 

Figure 5: Left: Histogram of the M(M — l)/2 correlation coefficients between different images 
in the database. Right: The upper left corner of size 200 x 200 of the full correlation matrix. 
Notice that only the absolute value of the correlation coefficients is discriminative in terms of 
color. The dark, off-diagonal regions are characteristic of correlated features. 
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6 3* 6 & & 



(a) True (b) Noisy (c) ES (d) Lasso (e) Lasso-Gauss 

Figure 6: Reconstruction of the digit "6" with a = 0.5. 



61616 

(a) True (b) Noisy (c) ES (d) Lasso (e) Lasso-Gauss 

Figure 7: Reconstruction of the digit "6" with a = 1.0. 



70 - 
60 - 



Figure 8: Boxplots of the predictive performance |/i — X0|| of the ES, Lasso and Lasso-Gauss 
(Lasso-G) estimators computed from 250 replications of the model (7.4) with /i corresponding 
to the digit "6". Left: a = 0.5. Right: a = 1. Notice that each graph uses a different scale. 
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Figure 9: Coefficients of #f?, T = 10,000 and the corresponding image. 
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8 Appendix 



8.1 Lemmas for the upper bound 

The following lemma is obtained by a variant of the "Maurey argument", cf. also Barron (1993); 
Bunea et al. (2007b); Bickel et al. (2008) for similar but somewhat different results. 

Lemma 8.1 For any 9* £ JR M \ {0}, any integer k > 1, and any function f we have 

min , ||/ -f f <||/-f<H| 2 + 



J0| 1= |0*li " mm(k,M(9*)) ' 

M{8)<k 



Proof. Fix 9* £ TR M \ {0} and an integer k > 1. Set K = mm(k,M(9*)). Consider the 
multinomial parameter p = (pi, . . . ,pm ) T € [0, l] M with pj = \9*\/\9*\i, j = 1,...,M, where 
9* are the components of 9*. Let k = (k\, . . . , km) T £ {0, 1, ... , M} M be the random vector 

with multinomial distribution A4(K,p), i.e., let Kj = XL=i ^C^ 5 = i) wnere ^i>--- i^K are i-i-d. 
random variables taking value j G {1, . . . , M} with probability pj, j = 1, . . . , M. In particular, 
we have £(Kj) = Kpj,j = l,...,M, where £ denotes the expectation with respect to the 
multinomial distribution. As a result, for the random vector 9 £ TR M with the components 9j = 
Kj sign(#*) \9*\i/K we have £(9j) = 0| for j = 1, . . . , M with the convention that sign(0) = 0. 
Moreover, using the fact that Var (ft,) = Kpj(l — Pj) and Cov(nj, ki) = —npjpi for j ^ I (see, 
e.g., Bickel and Doksum, 2006, eq. (A. 13. 15), p. 462) we find that the covariance matrix of 9 is 
given by 



£ 



_ e*) T ] = Qdiag(|0;|) - ^\9*\\9*\ T 



where \9*\ = (\9*\, . . . ,\9j\) T . Using a bias- variance decomposition together with the assumption 
maxj ||/j|| < 1, it yields that, for any function /, 

s\\s - y 2 = ii/-^ii 2 + -E F(x t ) T z*F( Xt ) < ||/ - v f + Ui , 

i=l 

where F(x{) = (fi(xi), . . . , /m(^j)) T ! i = l,...,n. Moreover, since 9 is such that \9\\ = \9*\i 
and M{9) < K, the lemma follows. ■ 



Lemma 8.2 FixM,n > 1 and assume that max j \\fj\\ < 1. For any function r\ and any constant 
v > we have 



min I \\f e - r?F + v a M ^ l og I 1 + 
6»eR M I n V 



eM 



M{6) V 



- U < min {\\f e - V \\ 2 + c<p nM (e)} (8.1) 
1/ J (9eR M 



w/iere c = (3 + j), (/?„ 5 m(0) = and for 9^0, 

<Pn,M(&) = < 



mm 



log 1 + 



,1*1! 



log 1 



_eMi^ N \ _l ^ 2 log(l+eM) 
|f|lV^ 



if <M < ||f e || 2 , 



otherwise . 



1.2) 
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Proof. Set 



A=min fe - 77 2 + i/ 2 ^^ log 1 + 



It suffices to consider IR M \ {0} instead of M M since A < ||f - r/|| 2 + ap n>M (0) = \\r]\\ 2 - Fix 
0* G K M \ {0} and define 



where £ = Jl og (l + ^=). 



Assume first x* < 1. In this case we have 



The previous display yields that (p n ,M(0*) = |#*| 2 - Moreover, if (fe*,r?) < ||fe*|| 2 , it holds 

llr/f^llf^-ryf + llf^f^ll^-^f + rif. 

As a result, 

^ < ||t?|| 2 < (pnM e *) if (fe*,v) < ||ffl«|| 2 and x* < I. (8.4) 

Set fe* = [x*] , i.e., fe* is the minimal integer greater than or equal to x* . Using the monotonicity 
of the mapping t4 £ log (l + ^p) for t > 0, and Lemma 8.1 we get, for any 0* € IR A/ \ {0} 
such that fe* < M(6*) , 

A < mta /|f < -„|> + ^ta ! fl + eM 



eeR" { n V M (°) v 1 



2 2 & t„ eM^ 



< min min < life — nil + ^ — log 1 H — — 

l<k<M{6*)0:M(6)<k y n \ k I 



< 



2 + mm \v 2 - log 1 + — + J— !i ^ 

i<fe<M(9*) [ n \ k J fe J 



ll2 2 fe *, / eM\ |#*| 2 
< fa* -r> 2 + v 2 — log 1 + — + 1 l! 



n \ fe* / fe* 
On the other hand, if 9* e IR A/ \ {0} and fe* > M(0*), we use the simple bound 

^4 < \\fg* -r)f +v 2 log [1 + 



n & V M ( 6 '*) v 1 
< ||f^ - r7[| 2 + z, 2 — log (^1 + — j . 



In view of the last two displays, to conclude the proof it suffices to show that 

a* 1 2 



2 fe* / eM\ \e*\i 



< c$nM B *) (8-5) 
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for all 0* g JR M \ {0}. Note first that if x* < 1, then k* = 1 and 



2 fc* / eM\ 10*1? ^ 2 log(l + eM) 10 

v 2 — log ( 1 + — + i-Ji < ^ ^ + !/L 

k* J k* n 



n 



n 



'log 1 + 



< 



where we used (8.3) in the first inequality. Together with (8.4), this proves that A < \\{q* — 
r/|| 2 + cif n ,M(0*) for all 0* € TR M \ {0} such that x* < 1. Thus, to complete the proof of the 
lemma we only need to consider the case x* > 1. For x* > 1 we have 



p „,„(0.)>,LJ^log 1 + _ 



eMi/ 



As a result, we have 



3*|2 



?.6) 



Moreover, it holds k* < 2x* = 2\Q*\\^fnj t and since the function 1 1-)- ^ log (l + ^) is increas- 
ing, we obtain 



A:* 



n 



eM\ 2\9*\ x 



— log 1 + — < 



k* J ~ ly/E 



log 1 + 



2|6»*|iVn 



Thus, for £ < v we have 



/c* 



, / eM \ z\u i] , / , 
— log 1 + — < 4-^ log 1 + 



210*1 



eMi/ 
2|0*|iV^ 



< „ - < — ¥?n,M(0 ) ■ 



For £ > v we use the inequality log(l + a&) < log(l + a) + log 6, V a > 0, 6 > 1, to obtain 



A;* / eM\ 2|0*h 



log ( 1 + ^-^ — ^ ) + log 



< 



2\6*\iy/n 
2|0*|i / £ Iog(£/i/) 



n + 



< 2 + - 



el i/Vi 



< (2+^ \cf>nM e *)- 



e v 



Thus, in both cases £ log (l + ^) < (2 + l/e>~ 2 i^ ni A/(0*). Combining this with (8.6) we 
get (8.5). " I 



8.2 Proof of Theorem 5.2 

Applying the randomization scheme described in Nemirovski (2000), p. 211, we create from the 
sample yi,...,y n satisfying (5.2) two independent subsamples with "equivalent" sizes [~n(l — 
1/ log log n)] and n — |~n(l — 1/ log log n)]. We use the first subsample to construct the ES 
estimator and the soft thresholding estimator SOPT j the latter attaining asymptotically the rate 
4^{0) for all 6 ]R n . We then use the second subsample to aggregate them, for example, as 
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described in Nemirovski (2000). Then the aggregated estimator denoted by 8* satisfies, for all 
9 G R n , 

E e \e*-e\\ < mm\E e \e S0FT - e\l E e \e™ - e\ 2 2 \ + cl °g lo g ra 



n 

< mm(CW,CW)(l + ^)) + C ' l0gl ° S " 



n 

where C > is an absolute constant and o(l) — > as n — > oo uniformly in £ H n . Set 

Ol/m , Clog log n 



^(^) = min(CW,CW) + 



Then (5.10) follows immediately. Next, 1 (9) > 2 (log n)/n, so that for all 6 K n , 
^(0) > ^{9) > 2(Iogn)/n 



V^n(6») ^C^) + C(loglogn)/n 2(logn)/n + C(loglogn)/n ' 
which implies (5.11). Finally, to prove (5.12) it is enough to notice that since 

tpn 1 ^) > 2(Iog n)/n, 
MO) <(fl) + C(loglogn)/n ^(g) Clog log n 

and to use (5.8). 

8.3 Proof of Theorem 5.3 

Clearly (5.17) follows from (5.18) since in the latter r] is fixed and equal to one particular function 
V = h- 

We now prove (5.18). Let H = {/i, . . . ,/m} be any dictionary in V(S A fh,K) with the 
corresponding k and R such that k/k = k. For any /c G {1, . . . ,M}, let fifc be the subset of 
7? = {0, 1} M defined by 

:={pGP : |p| = k}. (8.7) 

We consider the class of functions 



where < r < 1 will be chosen later. Note that functions in Fk{5) are of the form fg with 
9 G Mf \ {0}, M(Q) = k and \0\ x = t5 < 5. Thus, to prove (5.18), it is sufficient to show that, 
for any estimator T n , 

swpE v \\T n - ry|| 2 > c*K( n)M Mx){S,6) , (8.8) 

for some subset Q C F§(5) where S = [S A m A (M/2)] and [•] denotes the integer part. Note 
that S > 1 since M > 2 and S A m > 1. 

In what follows we will use the fact that for f,g€ J-§ (S) the difference / — g is of the form 
f# with some 9 G T^S' so that in view of (5.14), ||/ — g\\ 2 is bracketed by the multiples of |0|| 
with this value of 9. 
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We now consider three cases, depending on the value of the integer m defined in (5.16). 
Case (1): m = 0. Use Lemma 8.3 to construct a subset Qn\ C Ti(5) C ^§(5) with cardinality 
> (1 + eM) Cl and such that 

11/ -g\?> T -^, V/ l9 eg (1) ,/^. (8.9) 

Since m = 0, inequality (5.16) is violated for m = 1, so that 

2 2 

5 2 < — log (1 + eM) < log(s ( i)) • (8-10) 

Case (2): m > 1,5 A (M/2) > m. Then m = m = 5 and m < M/2, so that we have 
min(m, M — m) = m, and Lemma 8.3 guarantees that there exists Q^) Q J~m(5) = J~g(5) with 

5(2) 



cardinality s/ 2 ) > (1 + eM/m) c ' im and such that 



T 2 S 2 K 



To bound from below the quantity 5 2 /m, observe that from the definition of m we have 
5 2 5a 



> 

m \/n 



I ( eM\ 5a I ( eMa\ 



The previous two displays yield 



T 2 n5a L ( _ eMa 



»'- 9 » z ^fH 1+ ^)' < 8 - 12 > 

5yfn 



Note that in this case 



m + 1 > 



eM 



<7Wl0g(l+ w+| 

so that 

<5 2 2<5 2 , s <t 2 / eM \ a 2 , / eM\ 4a 2 , , , , . 

- < — < 2 m + 1 — log 1 + — < 4m— log 1 + < — log(s (2) ) • (8.13) 

m m + 1 n \ m + 1/ n \ my nL\ 

Case (3): m > 1, 5 A (M/2) < m. Then 5 = [5A(M/2)] < m. Moreover, we have min(5, M— 
5) = 5 and using Lemma 8.3, for any positive 5 < 5 we can construct C J- §(6) with cardi- 
nality S( 3 ) > (l + eM/S) ClS and such that 

||/-5|| 2 >^, V/^G^/^S, 

Take 



, Sir eM\ ^ m L A eM\ 
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where, in the last inequality, we used the definition of m. Next, note that S = [S7\(M/2)]1 > S/4 
since M > 2. Then 

ll2 rWS, / eM\ tWS, ( 4eM\ , d . 

In addition, we have 

P ^a 2 . / eM\ a 2 



5=^ l0 H 1+ ^J-^ l0g(S(3)) - (8 - 15) 

Since the random variables = 1, . . . ,n are i.i.d. Gaussian J\f(0,a 2 ), for any /, g € C/y), 
j G {1,2,3}, the Kullback-Leibler divergence JC(Pf,P g ) between Py and P 9 is given by 

n „ nT 2 5j--,R 

«^) = ^ll/- S ll 2 <^. 

where = 5( 2 ) = £,£(3) = ^^(i) = 1,^(2) = m >^(3) = Using respectively (8.10) in case (1), 
(8.13) in case (2) and (8.15) in case (3), and choosing r 2 = min(Ci/(32K), 1) (note that we need 
T < 1 by construction) we obtain 

K(P f ,P g )< ^ log ay) <^^, Vf,geQ {j ), J = 1,2,3. (8.16) 

Combining (8.9), (8.12) and (8.14) together with (8.16), we find that the conditions of Theo- 
rem 2.7 in Tsybakov (2009) are satisfied and use it to obtain (8.8). ■ 



8.4 A lemma for minimax lower bound 

Here we give a result related to subset extraction, which is a generalization of the Varshamov- 
Gilbert lemma used to prove minimax lower bounds (see, e.g., a recent survey in Tsybakov 
(2009)[Chap. 2]). For any M > 1, k € {1, . . . , M - 1}, let Of be the subset of {0, 1} M defined 
by: 

M 

u e {o,i} M -.J2 U} j = k 

3=1 



The next lemma is a modification of Birge and Massart (2001, Lemma 4). The difference is 
that we cover any M > 2, 1 < k < M. The result of Birge and Massart (2001) is proved for 
even integer k such that M > 3k > 6. The price we pay for considering general M, k is only in 
terms of constants, which is sufficient for our purposes. 

Lemma 8.3 Let M > 2 and 1 < k < M be two integers and define k = min(fc, M — k). Then 



there exists a subset f2 of such that the Hamming distance p(u>,u>') = Y2j=i -"-(^j ^ w j 
satisfies 

p(co,uj')> — y — , Vw,u'6n: uj ^ uJ , 
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and s = card(fi) satisfies 

log(s) > dfclog f 1 H- — ) , 



/or some numerical constant C\ > 9 • 10 . 



Proof, (i) Consider first the case where k = 2p for some integer p > 1 and M > 6p. Lemma 4 
in Birge and Massart (2001) ensures the existence of a subset Cl^ of Cl¥ such that p{uj,uj') > 
k/2 + 1 > (fc + 1) /2 for any w 7^ w' G and 

log (card^ 1 ))) > p [log(M/p) - log(16) + 1] = | log f . (8.17) 

(ii) Next, if k = 2p + 1 for some integer p > 18 and M > 6p + 3, let O C Cl^ 1 be the 
set obtained by Lemma 4 in Birge and Massart (2001). We have p(u),u)') > {k + l)/2 for any 
u;,u/ G 7^ cj and 

, / /e(M-l)\ fc, f eM\ 
log card(ft) > — - log > - log — , (8.18) 



2 V 8(fc - 1) / ~ 3 °\8k 
where we used the fact that 3 < k < M. Define now the set 



= |cj g {0, 1} A/ :u = (l,w),wefl}. 



We have Vl^ C fif , card(^( 2 )) = card(O) and p(u,u/) > (fc + l)/2 for any w,a/ G 0( 2 ),u/ / w. 

So far, we have fully covered M, A; such that M > 3k, k > 36. We consider now respectively 
the cases (hi) 2k < M < 3k, k > 72, (iv) k < 71, M > 2fc, and (v) M < 2/c. 

(hi) If 2k < M < 3k, k > 72, let be the integer part of k/2: k' = [k/2] > 36, and observe 
that 3k' < M' where M' = M — (k — k') < M. Therefore, we can apply the preceding results to 

M 
k> 



ensure that there exists a subset Cl of Cl¥,' such that 



log (card(SZ)J > — log 



and p(to,uj') > (A;' + l)/2 for any u,uj' € CI, ui ^ ui' . Since fc' > /c/3, we obtain 

k ( eM' \ 

log(card(0)) >-tog^— J . (8.19) 

To embed CI in fij^, define 

Cl {3) = I u G {0, 1} M : w = ( 1, . . . , 1 , w) , weO 

V fc— fc' times 

We have f}( 3 ) C Of, card(ft( 3 )) = card(fj) and p{w t uf) > (k' + l)/2 > (jfe + l)/4 for any 
w,o/ G Cl( 3 \u}' / w. 
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(iv) If k < 71, M > 2k, consider the set fiW = {wW, . . .,oj^ M / k ^} C Of , such that, for any 



1, . . . , [M/A;], the Z-th coordinate of w^') satisfies 



1 if and only if (j — 1) k + 1 < Z < jk. 



We have p{uj,u') = 2k > (k + l)/4 for any w,u/ G OS > ,oj' / w and 

log 2 



log ^card 





~M~ 


) = log ( 


k 



> 



> 



log(l + 2e) 10g l 1 + — 



k 



log 2 



711og(l + 2e) °H + / 



> 0.005felog ( 1 + — J . 



.20) 



Note that (i)-(iv) cover all M > 2k and k > 1, and in these cases k = k. We now use (8.17), 
.18) and (8.19) jointly with the following inequality 



ilogf| 



8/ - 91og(l + 3e) 



> 



log(l + x) > 0.0009 log(l 



x > 3e . 



This yields the result of the lemma for cases (i), (ii) and (iii) since in these cases M/k > 3 and 
M'/k' > 3. For case (iv) we use directly (8.20). Thus, the lemma is proved for M > 2k. 

(v) Finally, if M < 2k, or equivalently, when M — k < k, we can reproduce all the arguments 
above with k replaced by k = M — k which satisfies 2k < M. In each case, i = 1, ... ,4, we 
obtain the subsets (ffl C analogous to Q,® in (i)-(iv). They are uniquely mapped into £1 
by applying the bijection w^l-w, where 1 = (1 . . . , 1) G {0, 1} M . 
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